Back to notes Problems January 3, 2026 83 words

Vanishing Gradients

If the activation function's gradient is less than one on average, gradients go to zero as they layers increase. This kills the signal and network doesn't learn.

Sigmoid's derivative is 0.250.25 at maximum, so it's very prone to this problem. Tanh's derivative is 11 at maximum, it also vanishes the gradient.

It's solved by:

  • ReLU, by keeping derivative 11 constantly.
  • Residual connections by keeping a residual path, y=x+f(x)y = x + f(x), which always adds one even when activation function's gradient is 0.